This report uses machine learning to predict whether a home was built before 1980 and estimate its exact construction year. Four classification models were tested—Random Forest, Gradient Boosting, Decision Tree, and Gaussian Naive Bayes. Gradient Boosting achieved the highest accuracy at 94% with the best F1 score, making it the most effective. Key predictors included architectural style, number of bathrooms, and garage type.
After merging neighborhood-level features into the dataset, Gradient Boosting maintained its strong performance, showing its adaptability. For regression, it also outperformed other models with an R^2 of 0.87 and a mean absolute error of 7.4 years. This confirms its reliability for estimating home age.
This project highlights practical skills in data preparation, model tuning, and performance evaluation, with results showing Gradient Boosting as the top choice for both classification and regression tasks in housing data.
# Split dataset into training and testing setsx_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=70)x2_train, x2_test, y2_train, y2_test = train_test_split(x2, y, test_size=0.25, random_state=70)x3_train, x3_test, y3_train, y3_test = train_test_split(x3, y, test_size=0.25, random_state=70)
Model Trainings:
Show the code
# Train Random Forest Classifierclassifier_DT = RandomForestClassifier( n_estimators=175, max_depth=15, min_samples_split=5, min_samples_leaf=2, max_features='sqrt', random_state=70)classifier_DT.fit(x_train, y_train)# Train Gradient Boosting Classifierclassifier_DT2 = GradientBoostingClassifier(random_state=70, learning_rate=0.05, max_depth=12, # Sweet spot, don't change n_estimators=175, subsample=0.8)classifier_DT2.fit(x2_train, y2_train)# Train Decision Tree Classifierclassifier_DT3 = DecisionTreeClassifier( max_depth=15, # limits tree depth to prevent overfitting min_samples_split=20, # must have 20 samples to split a node min_samples_leaf=10, # each leaf must have at least 10 samples max_features='sqrt', # use sqrt(n_features) at each split random_state=70)classifier_DT3.fit(x3_train, y3_train)# Train Gaussian NB Classifierclassifier_DT4 = GaussianNB()classifier_DT4.fit(x_train, y_train)
GaussianNB()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Below, I will displays the top 7 features for each model, to examine the uniqueness of each model.
Random Forest: The top 7 important features of the Random Forest model are well within the same range, namely: 6% - 12%. Arcstyle One-story tops it off, with stories close behind in second. An interesting one to note is the number of baths was 5th on importance with 8% importance.
Gradient Boosting: The top 7 important features of the Gradient Boosting model are a little more spread out: Arctyle-One-story tops it off as well interestingly at a whopping 23.7%! The remaining 6 features fall within the range 3% and 15%. Stories only had a 3% importance in this model compared to 11% in the Random Forest model.
Decision Tree: The top 7 important features of the Decision Tree model are heavily weighted in two features: Quality Type C (21%) and Living Area (20%). Interesting, Arcstyle-One-Story does not make this top 7 list but Arctyle Two-Story does.
One thing to note based on the top 7 features of each model is, Decision tree is heavily weighted towards 2 categories, possibly over-fitting them. Gradient Boosting model have one major category, but not as skewed as Decision Tree. Random Forest seems to find a multilayered relationship as each category is tighter together in importance. So far I am leaning towards Random Forest Model.
Model Accuracies and Performance’s
Below, I will display the accuracies, recall, precision, and f1 scores for each model.
Gaussian NB seems to be horribly accurate with a score of 66%.
Based on these scores, Gradient Boosting seems to also be the most accurate, even when accounting for false positives and false negatives. Its weighted f1 score of 94% bested RF’s 92%. Given our context of predicted whether a home was built 1980 and before and the concern with asbestos, I was particularly more curious about the recall scores of these two classifiers. GB still bests RF’s in each Recall score but about 3%.
New Data?
In order to determine which model is best, let’s add new data to our set and see if the models scores change. I will choose to only analyze the Random Forest and Gradient Boosting classifiers as they were the top ones in my previous tests, with the closest scores.
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Based on the new data, Gradient Boosting took a major leap forward in all categories by 2-3%! My recommendation of using a Gradient Boosting Classifier stands, especially with this new merged dataset.
Regressors
One last thing to end the report, I will not look to not only predict before 1980 or not, but to predict the the actual year the house was built. I will look at only two models, Gradient Boosting Regressor and Random Forest Regressor.
Show the code
x = new_data.drop(['before1980', 'parcel', 'yrbuilt'], axis=1)y = new_data['yrbuilt']# Split dataset into training and testing setsx_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=70)regressor_RF = RandomForestRegressor( n_estimators=175, max_depth=15, min_samples_split=5, min_samples_leaf=2, max_features='sqrt', random_state=70)regressor_RF.fit(x_train, y_train)regressor_GB = GradientBoostingRegressor(random_state=70, learning_rate=0.05, max_depth=12, # Sweet spot, don't change n_estimators=175, subsample=0.8)regressor_GB.fit(x_train, y_train)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Based on the regression metrics, Gradient Boosting clearly outperforms Random Forest in predicting the exact year a home was built. Gradient Boosting achieved an R^2 score of 0.87, indicating that it explains approximately 87% of the variance in construction year, compared to 76% with Random Forest. Additionally, it produced a much lower Mean Absolute Error (MAE) of 7.4 years versus 12.8 years for Random Forest. This means on average, Gradient Boosting’s predictions are over 5 years closer to the true year. These results confirm that Gradient Boosting is the better regressor for this task, providing more accurate and reliable predictions for clients interested in estimating home age.